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Abstract — This paper studies the problem of detecting the in- 
formation source in a network in which the spread of information 
follows the popular Susceptible-Infected-Recovered (SIR) model. 
We assume all nodes in the network are in the susceptible state 
initially except the information source which is in the infected 
state. Susceptible nodes may then be infected by infected nodes, 
and infected nodes may recover and will not be infected again 
after recovery. Given a snapshot of the network, from which we 
know all infected nodes but cannot distinguish susceptible nodes 
and recovered nodes, the problem is to find the information 
source based on the snapshot and the network topology. We 
develop a sample path based approach where the estimator of the 
information source is chosen to be the root node associated with 
the sample path that most likely leads to the observed snapshot. 
We prove for infinite-trees, the estimator is a node that minimizes 
the maximum distance to the infected nodes. A reverse-infection 
algorithm is proposed to find such an estimator in general graphs. 
We prove that for (/-regular trees such that gq > 1, where g is 
the node degree and q is the infection probability, the estimator 
is within a constant distance from the actual source with a high 
probability, independent of the number of infected nodes and the 
time the snapshot is taken. Our simulation results show that for 
tree networks, the estimator produced by the reverse-infection 
algorithm is closer to the actual source than the one identified by 
the closeness centrality heuristic. We then further evaluate the 
performance of the reverse infection algorithm on several real 
world networks. 

I. Introduction 

Diffusion processes in networks refer to the spread of 
information throughout the networks, and have been widely 
used to model many real-world phenomena such as the 
outbreak of epidemics, the spreading of gossips over online 
social networks, the spreading of computer virus over the 
Internet, and the adoption of innovations. Important properties 
of diffusion processes such as the outbreak thresholds (TJ and 
the impact of network topologies [2| have been intensively 
studied. 

In this paper, we are interested in the reverse of the diffusion 
problem: given a snapshot of the diffusion process at time 
t, can we tell which node is the source of the diffusion? 
The answer to this problem has many important applications, 
and can help us answer the following questions: who is the 
rumor source in online social networks? which computer is the 
first one infected by a computer virus? who is the one who 
uploaded contraband materials to the Internet? and where is 
the source of an epidemic? 



We call this problem information source detection problem. 
This information source detection problem has been studied 
in (3)-[5] under the Susceptible-Infected (SI) model, in which 
susceptible nodes may be infected but infected nodes cannot 
recover. The authors formulated the problem as a maximum 
likelihood estimation (MLE) problem, and developed novel 
algorithms to detect the source. 

In this paper, we adopt the Susceptible-Infected-Recovered 
(SIR) model, a standard model of epidemics (6), (7). The 
network is assumed to be an undirected graph and each node in 
the network has three possible states: susceptible (S), infected 
(/), and recovered (R). Nodes in state S can be infected 
and change to state /, and nodes in state / can recover and 
change to state R. Recovered nodes cannot be infected again. 
We assume that initially all nodes are in the susceptible state 
except one infected node (called the information source). The 
information source then infects its neighbors, and the infor- 
mation starts to spread in the network. Now given a snapshot 
of the network, in which we can identify infected nodes 
and healthy (susceptible and recovered) nodes (we assume 
susceptible nodes and recovered nodes are indistinguishable), 
the question is which node is the information source. 

We remark that it is very important to take recovery into 
consideration since recovery can happen due to various reasons 
in practice. For example, a contraband material uploader may 
delete the file, a computer may recover from a virus attack after 
anti-virus software removes the virus, and a user may delete 
the rumor from her/his blog. In order to solve the information 
source detection problem in these scenarios, we study the SIR 
model in this paper, which makes the problem significantly 
more challenging than that in the SI model as we will explain 
in the related work section. 

A. Main Results 

The main results of this paper are summarized below. 

• Similar to the SI model, the information source detection 
problem can be formalized as an MLE problem. Unfor- 
tunately, to solve the MLE problem, we need to consider 
all possible infection sample paths, and for each sample 
path, we need to specify the infection time and recovery 
time for each healthy node and the infection time for each 
infected node, so the number of possible sample paths 
is at the order of £l(t N ), where N is the network size 



and t is the time the snapshot is obtained. Therefore, the 
MLE problem is difficult to solve even when t is known. 
The problem becomes much harder when t is unknown, 
which is the assumption of this paper. To overcome this 
difficulty, we propose a sample path based approach. We 
propose to find the sample path which most likely leads 
to the observed snapshot and view the source associated 
with that sample path as the information source. We 
call this problem optimal sample path detection problem. 
We investigate the structure properties of the optimal 
sample path in trees. Defining the infection eccentricity 
of a node to be the maximum distance from the node 
to infected nodes, we prove that the source node of 
the optimal sample path is the node with the minimum 
infection eccentricity. Since a node with the minimum 
eccentricity in a graph is called the Jordan center, we 
call the nodes with the minimum infection eccentricity 
the Jordan infection centers. Therefore, the sample path 
based estimator is one of the Jordan infection centers. 

• We propose a low complexity algorithm, called reverse 
infection algorithm, to find the sample path based esti- 
mator in general graphs. In the algorithm, each infected 
node broadcasts its identity in the network, the node who 
first collect all identities of infected nodes declares itself 
as the information source, breaking ties based on the sum 
of distances to infected nodes. The running time of this 
algorithm is equal to the minimum infection eccentricity, 
and the number of messages each node receives/sends at 
each iteration is bounded by the degree of the node. 

• We analyze the performance of the reverse infection 
algorithm on ^-regular trees, and show that the algorithm 
can output a node within a constant distance from the 
actual source with a high probability, independent of the 
number of infected nodes and the time the snapshot is 
taken. 

• We conduct extensive simulations over various networks 
to verify the performance of the reverse infection algo- 
rithm. The detection rate over regular trees is found to 
be around 60%, and is higher than that of the infection 
closeness centrality (or called distance centrality) heuris- 
tic. The infection closeness of a node is defined to be 
the inverse of the sum of distances to infected nodes and 
the infection closeness centrality heuristic is to claim the 
node with the maximum infection closeness as the source. 
Note that in (3)-|[5), the authors proved the node with 
the maximum infection closeness is the MLE on regular 
trees. For real world networks, our experiments also 
show that the reverse infection algorithm outperforms 
random guesses significantly. We then further evaluate the 
performance of the reverse infection algorithm on several 
real world networks. 

B. Related Work 

There have been extensive studies on the spread of epi- 
demics in networks based on the SIR model (see (T], (2), 
|8|, [9] and references within). The work most related to this 



paper is [3 |-[5], in which the information source detection 
problem was studied under the SI model. [ [TO) , fTT) con- 
siders the problem of detecting multiple information sources 
under the SI model. This paper considers the SIR model, 
where infection nodes may recover, which can occur in many 
practical scenarios as we have explained. Because of node 
recovery, the information source detection problem under the 
SIR model differs significantly from that under the SI model. 
The differences are summarized below. 

• The set of possible sources in the SI model (3j-|[5j 
is restricted to the set of infected nodes. In the SIR 
model, all nodes are possible information sources because 
we assume susceptible nodes and recovered nodes are 
indistinguishable and a healthy node may be a recovered 
node so can be the information source. Therefore, the 
number of candidate sources is much larger in the SIR 
model than that in the SI model. 

• A key observation in (3]-|[5j is that on regular trees, 
all permitted permutations of infection sequences (a in- 
fection sequence specifies the order at which nodes are 
infected) are equally likely under the SI model. The 
number of possible permutations from a fixed root node, 
therefore, decides the likelihood of the root node being 
the source. However, under the SIR model, different 
infection sequences are associated with different prob- 
abilities, so counting the number of permutations are not 
sufficient. 

• (3]-|[5) proved that the node with the maximum closeness 
centrality is the an MLE on regular-trees. We define the 
infection closeness centrality to be the inverse of the 
sum of distances to infected nodes. Our simulations show 
that the sample path based estimator is closer to the 
actual source than the nodes with the maximum infection 
closeness. 

Other related works include: (1) detecting the first adopter of 
innovations based on a game theoretical model [12| in which 
the authors derived the MLE but the computational complexity 
is exponential in the number of nodes, (2) network forensics 
under the SI model fl3) , where the goal is to distinguish an 
epidemic infection from a random infection, and (3) geospatial 
abduction problems (see (14) , (T?) and references within). 

II. Problem Formulation 

A. The SIR Model for Information Propagation 

Consider an undirected graph G — {V,£}, where V is 
the set of nodes and £ is the set of (undirected) edges. 
Each node v e V has three possible states: susceptible (S), 
infected (/), and recovered (R). We assume a time slotted 
system. Nodes change their states at the beginning of each 
time slot, and the state of node v in time slot t is denoted 
by X v (t). Initially, all nodes are in state S except node v* 
which is in state I and is the information source. At the 
beginning of each time slot, each infected node infects each 
of its susceptible neighbors with probability q, independent of 
other nodes, i.e., a susceptible node is infected with probability 
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Figure 1 . An Example of Information Propagation 



1 — (1 — q) n if it has n infected neighbors. Each infected 
node recovers with probability p, i.e., its state changes from / 
to R with probability p. In addition, we assume a recovered 
node cannot be infected again. Since whether a node gets 
infected only depends on the states of its neighbors and 
whether a node becomes a recovered node only depends on 
its own state in the previous time slot, the infection process 
can be modeled as a discrete time Markov chain X(t) where 
X(f) = {X v (t),v £ V} is the states of all the nodes at time 
slot t. The initial state of this Markov chain is X V (Q) = S for 
v v* and X v , (0) = I. 

B. Information Source Detection 

We assume X(£) is not fully observable since we cannot 
distinguish susceptible nodes and recovered ones. So at time 
t, we observe Y = {Y v , v £ V} such that 



Y„ = 



1, if v is in state I; 

0, if v is in state S or R. 



The information source detection problem is to identify v* 
given the graph G and Y, where t is an unknown parameter. 

Figure [T] is an example of the infection process. The left 
figure shows the information propagation over time. The nodes 
on each dotted line are the nodes which are infected at that 
time slot, and the arrows indicate where the infection comes 
from (e.g., node 4 is infected by node 2). 

The figure on the right is the network we observe, where 
the shaded nodes are infected nodes and others are susceptible 
or recovered nodes. The pair of numbers next to each node 
are the corresponding infection time and recovery time. For 
example, node 3 was infected at time slot 2 and recovered 
at time slot 3. —1 indicates that the infection or recovery 
has yet occurred. Note that these two pieces of information 
are not available to us, and we include them in the figure to 
illustrate the infection and recovery processes. If we observe 
the network at the end of time slot 3, then the snapshot of 
the network is Y = {0,1,0,1,0,1,1}, where the states are 
ordered according to the indices of the nodes. 

C. Maximum Likelihood Detection 

We define X[0,i] = {X(r) : < r < *} to be a sample 
path of the infection process from to t. In addition, we define 
function F(-) such that 



F(X v (t)) = 



1, if X v («) = /; 
0, otherwise. 



We say F(X[t]) = Y if F(X v (t)) = Y v for all v. Identifying 
the information source can be formulated as a maximum 
likelihood detection problem as follows: 

i>t £ argmax V" Pr(X[0,i]|u* = v), 

Ve X[0,t]:F(X(t))=Y 

where Pr(X[0,i]|u* = v) is the probability to obtain sample 
path X[0,i] given the information source is node v. 

We note the difficulty of solving this maximum likelihood 
problem is the curse of dimensionality. For each v such that 
Y v = 0, we need to decide its infection time and recovery 
time (the node is in susceptible state if the infection time is 
> t), i.e., 0(t 2 ) possible choices; for each v such that Y v — 
1, we need to decide the infection time, i.e., 0(t) possible 
choices. Therefore, even for a fixed t, the number of possible 
sample paths is at least at the order of t N , where N is the 
number of nodes in the network. This curse of dimensionality 
makes it computationally expensive, if not impossible, to solve 
the maximum likelihood problem. To overcome this difficulty, 
we propose a sample path based approach which is discussed 
below. 

D. Sample Path Based Detection 

Instead of using the MLE, we propose to identify the sample 
path X*[0,t*] that most likely leads to Y, i.e., 

X*[0,f]=arg max Pr(X[0,i1), (1) 

t,x.[o,t]ex(t) 

where X(t) = {X[Q, i]|F(X(i)) = Y}. The source node 
associated with X*[0, i*] is then viewed as the information 
source. 

III. Sample Path Based Detection On Tree 
Networks 

The optimal sample paths for general graphs are still diffi- 
cult to obtain. In this section, we focus on tree networks and 
derive structure properties of the optimal sample paths. 

First, we introduce the definition of eccentricity in graph 
theory [16]. The eccentricity e(i>) of a vertex v is the maximum 
distance between v and any other vertex in the graph. The Jor- 
dan centers of a graph are the nodes which have the minimum 
eccentricity. For example, in Figure [2] the eccentricity of node 
V\ is 4 and the Jordan center is v 2 , whose eccentricity is 3. 

Following a similar terminology, we define the infection 
eccentricity e(v) given Y as the maximum distance between 
v and any infected nodes in the graph. Define the Jordan 
infection centers of a graph to be the nodes with the mini- 
mum infection eccentricity given Y. In Figure |2j nodes v^, 
vio, V13 and V14 are observed to be infected. The infection 
eccentricities of i>i, i>2) W3, are 2,3,4,5, respectively, and 
the Jordan infection center is v\ . 

We will show that the source associated with the optimal 
sample path is a node with the minimum infection eccentricity. 
We derive this result using three steps: first, assuming the 
information source is v r , we analyze t* such that 
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t,X[0,t] 
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Figure 2. An Example Illustrating the Infection Eccentricity 



i.e., <* is the time duration of the optimal sample path in 
which v r is the information source. It turns out that t* equals 
to the infection eccentricity of node v r . Considering Figure 
[2] if the source is v±, then the time duration of the optimal 
sample path starting from V\ is 2. 

In the second step, we consider two neighboring nodes, say 
nodes v\ and v 2 . We will prove that if e{v\) < e(v 2 ), then 
the optimal sample path rooted at v-y occurs with a higher 
probability than the optimal sample path rooted at v 2 . 

Finally, at the third step, we will show that given any two 
nodes u and v, if v has the minimum infection eccentricity 
and u has a larger infection eccentricity, then there exists 
a path from u to v along which the infection eccentricity 
monotonically decreases, which implies that the source of 
the optimal sample path must be a Jordan infection center. 
For example, in Figure [2] node v<t has a larger infection 
eccentricity than v\ and V4 — > V3 —> v 2 — >• v\ is the path along 
which the infection eccentricity monotonically decreases from 
5 to 2. 

A. The Optimal Time 

Lemma 1. Consider a tree network rooted at v r and with 
infinitely many levels. Assume the information source is the 
root, and the observed infection topology is Y which contains 
at least one infected node. If e(v r ) < t\ < t 2 , then the 
following inequality holds 

max Pr(X[0,iil)> max Pr(X[0,i 2 l), 
x[o,ti]e#(to x[o,t 2 ]e#(t 2 ) 

where X(t) = {X[0, i]|F(X(i)) = Y}. In addition, 

t* — e{v r ) — maxc?(u r ,u), 

where d(v r ,u) is the length of the shortest path between v r 
and u and also called the distance between v r and u, and X 
is the set of infected nodes. □ 

Proof: We start from the case where the time difference 
of two sample paths is one, i.e., we will show that 

max Pr(X[0,tl)> max Pr(X[0, t + 11). 

X[0,t]e*(t) X[0,t+l]6Af(t+X) 

(2) 

We divide all possible infection topologies Y into countable 
subsets {y k } where y k is the set of infection topologies where 



the largest distance from v r to an infected node is k. y° is the 
topology where there is only one infected node — the root node 
v r . Note that if no infected node is observed, no algorithm 
performs better than a random guess. To prove we use 
induction over k. 

Step 1: First, we consider the case k = 0. All the sample 
paths considered in step 1 lead to observation Y E y°. We 
denote by T Vr the tree rooted in v r and T~ Vr the tree rooted at 
u but without the branch from v r . For example, Figure[3]shows 
T~ v ^ T~ 2 Vr , T~ 3 Vr and T~^ r . The sample path from time 
slot to t restricted to T~ Vr is denoted by X([0, t],T~ Vr ). 
Furthermore, denote by C(v) the set of children of v. We have 

Pr(X[Q,i]) 

= Pr(X Ui .(s) =I,0<s<t) 

x J] Pr(X([0,t],T-^)\X Vr (t) = I) 

u£C{v T ) 

= (l-pf I] P r (X([0;t],T-^)\X Vr (t)=I), 

uec(v r ) 

where the last equality holds since v r is the only infected 
node in the network at time t, which requires X Vr (s) = I for 
< s < t. Node u £ C(v r ) has two possible states S or R. 

Step l.a u is susceptible if it was not infected within t time 
slots. In each time slot, v r tries to infect u with probability q. 
The probability that u is susceptible at time slot t is 

(1 

which implies that 

PT(X([0,t} 7 T- v ")\X Vr (t) = I) = (l-qY (3) 
if X u (t) = S. 

Step l.b If u is in the recovered state, we denote by t u 
and its infection and recovery times, respectively. Then, 
we have if X u (t) = R, 

PT(X([0,t],T-^)\X Vr (t) = I) 

= (l-q)<- 1 q(l-P) tS - tL - 1 P 
JJ Pr(X([0,i],T-«)|4,^), 

wec(u) 

where (1 — q) t ™~ 1 q(l — p)'« _ *« _1 p is the probability that 
node u was infected at time t T u , and recovered at time t^. 
Since T~ u is also an infinite tree, there exists at least one 
node £ € T~ u such that the node is in the susceptible state 
but its parent node (say node 7) is in the recovered state. We 
denote by T u 7"\Tj~ 7 the set of nodes that are on subtree T~ u 
but not on subtree T^ 1 . Then, 



Pr (X([0,t],T-")|<,^) 
= Pr(x([0,t],T-«\T^)|4,^) 

xPr (x([0,i],T^)|4,^) 

= Pr(x([0,t],T-"\Tni*«,*«) (l-q)<-< (D 
<(!-<?), (8) 



(4) 
(5) 

(6) 




Figure 3. Example of Lemma ^ 



where equation (|7]i holds because £ remained to be susceptible 
during the time slots at which 7 was in the infected state 
and (|8j) holds because t* 1 — t* > 1. The maximum value of 
Pr (X([0, t], T~ u )\t I u , t^j can be achieved in the sample path 
in which u was infected and then recovered in the next time 
slot so that w was vulnerable to infection only in one time 
slot. Furthermore, 

is maximized when t u = = 2 i.e., u was infected at the 
first time slot and recovered in the second time slot. Therefore, 

if X u {t) =R, 

Pr(M[0,t],T-^)\X Vr (t) = I) < gp(l-g)|CMI 



(9) 



Step l.c Define X*[0,£] to be the optimal solution to 

max Pr (XfO.il). 

x[o,t]e#(t) 

For t = 1, since all u € C(v r ) are in the susceptible state, 

Pr(X*([0,t]) = (l-p)(l- (Z )l c ^)l. (10) 

For t > 2, according to <|3j and 

Pr(X*([0,f]) (11) 
= {l-pf J] max{(l-g) t , g p(l- (? )l c Ml}. (12) 



Note that t is fixed in this optimization problem and ( 12 1 is a 
none-increasing function of t. Since \C(u)\ > 1, 

Pr(X*([0,2]) 

< (1 - v? max {(1 - qf\ c ^)\ , (gp(l - g ))l c ^)l J 

<(l-p)(l- g )l c ^)l 
= Pr (X*[0,1]). 

In a summary, Pr (X*([0, i]) is a none-increasing function of 
i € [1, 00) when k = 0. 

Step 2: Assume |2]) holds for k < n, and consider fc = n+1. 
Clearly t > n + 1 > 1 for each X[0, i] such that 

P(X[o,i])er +1 . 

Furthermore, the set of subtrees T — {T~ Vr \u £ C(v r )} are 
divided into two subsets: T h = {T~ Vr \u G C(v r ),Y(T~ Vr ) f] 
1 = 0} and T l = T\T h , where Y(T" U '') is the vector of 



Y restricted to subtree T~ Vr . In Figure 3, T h = {T~ 4 Vr } 
and T l = {T~ Vr , T"% T~ tv }. We note that given t£, the 
infection processes on the sub-trees are mutually independent. 

Step 2.a Recall that T h is the set of subtrees having no 
infected nodes. Following the argument for the k — case, 
we can obtain that if T~ v - e T h , then 

Pr(X*([0, t] , T-^)\tl) = max { (1 - qfl , qp (l - q f^)\ j 

when t>ty and 

Pr(X*([0,t],T-^)|i£) = max{(l - q)\qp{l - q)^) 

when t < t£. So Pr(X([0, t], T t 7 u ' )|t^) is non-increasing in 
i given any t^. 

Step 2.b For T~ Vr e T\ given the sample path X([0,i + 
l],T~ v r), we will construct a sample path X([0,t],T~ Vr ) 
which occurs with a higher probability. Denote the infection 
time of u in sample path X([0, t], T~ Vr ) by t£. We let t{ 
denote the infection time in sample path X([0, i + l],T~ Vr ). 



If if, > 1, we choose if, 



if, 



1, i.e., u is infected one 



time slot later in X[0, t + 1] than that in X[0,i]. Assume 
the infection processes after u was infected are the same in 
the two sample paths X([0, t],T~ Vr ) and X([0,t + 1],T- Vr ). 
Therefore, we have 

Pr(X([0,t+l],T-^O) = (l-g)*«-VPr(X([0,i+l],T-^)|^), 
and 

Pr(X([0,i],T-^)) = (l-qf^qPriXiMT-^lti). 

where Pr(X([0, i], is the probability of 

X([0, t], T~ Vr ) after u was infected. Since the sample 
paths X([0,t],T- v -) and X([0,£ + l],T~ Vr ) are the same 
after u was infected, we obtain 

Pr(X([0,i],T-^)|^)=Pr(X([0,£ + l],T-^)|^). 
Therefore, with i£ = — 1, we get 

Pr(X([0, i + 1], T~ Vr )) < Pr(X([0, f], T^)) 

If = 1, we set £« = t^, = ljj Based on the induction 
assumption, for k < n since Y(T~ Vr ) E y ,n ,m < n, we 
have 

max Pr(X([0,i],T-^)) 

~x.([<d,t],T- Vr )ex(t,T- Vr ) 

> max Pr(X([0,t + l],T-^)), 

X([0,t+l],T^ Vr )eX(t+l,T^ nJr ) 

where X(t,T~ Vr ) = {X([0, t],T~ Vr ) : F(X([0, t],T~ Vr )) = 
Y ( T u "'')}■ Therefore, given any X([0,i + l],! 1 -"'-), we 
can always find a corresponding sample path X([0, i], T~ Vr ), 
which occurs with a higher probability. 

Step 2.c Now we consider the sample path X* [0, t+1] and 
denote by the recovery time of node v r in X*[0, t + 1]. 
We now construct a sample path X[0, t] as follows: 

' Note that we cannot apply the same argument to i j, > 1 because t 1 ^ = t J u 
may not be feasible in a valid X[0, t]. 



• If > t + 1, i.e., v r is an infected node, then > t, 
where i*f is the recovery time of v r in X[0, t]. 

. If < t, we choose t£ = t£. 

. If = t + 1, we choose = i. 
We further complete X[0, i] by having optimal ones on T h 
and constructing the ones in T % following step 2.b. According 
to steps 2. a and 2.b, it is easy to verify that X[0, t] occurs with 
a higher probability than X* [0, t + 1]. Therefore, we conclude 
that inequality |2]) holds for A: = n + 1, hence for any k 
according to the principle of induction. 

Step 3 Repeatedly applying inequality we obtain that 
t* is the minimum amount of time required to produce the 
observed infection topology. The minimum time required is 
equal to the maximum distance from v r to an infected node. 
Therefore, the lemma holds. ■ 

B. The Sample Path Based Estimator 

After deriving t*, we have a unique t* for each v € V. The 
next lemma states that the optimal sample path starting from 
a node with a smaller infection eccentricity is more likely to 
occur. 

Lemma 2. Consider a tree network with infinitely many levels. 
Assume the information source is the root, and the observed 
infection topology is Y which contains at least one infected 
node. For m,d6V such that (u, v) g £, if t* u > t* v . then 

Pr(X;([0X]))<Pr(X:(M)), 

where X* [0, t* t ] is the optimal sample path starting from node 
u. 

Proof: Recall that T v denotes the tree rooted at v and 
T~ v denotes the tree rooted at u but without the branch 
from v. Furthermore, C(v) is the set of children of v, and 
X([0,t],T~ v ) is the sample path X[0,t] restricted to T~ v . 

Step 1: The first step is to show t* = t* + l. First we claim 
T~ u 111 ^ 0. Otherwise, all infected node are on T u v . Since 
on a tree, v can only reach nodes in T~ v through edge (u, v), 
t* = i* + 1, which contradicts t* > i*. 

If T~ v nl^ 0, Va € T~ v n 1, we have 



d(u, a) = d(v, a) 



1 < t: 



1- 



and V6 € T" M n 1, 

d(u,b) = d(v,b) + !<<* + !. 



Hence, 



which implies that 



t < t* 



1- 



tz < tz < tl 



i. 



i.e., t* =t* v + l. 

If T~ v n 2 = 0, all infected nodes are in T~", so it is 
obvious t* = t* + 1. 

Step 2: In this step, we will prove that t v = 1 on the sample 
path X;[0,i£]. If t{ > 1 on X;([0,rj), then 

.* _ .J .* , i _ .7 < .* 



Note that according to the definition of £* and within t*— ^ 
time slots, node u can infect all infected nodes on T~ u . Since 
f* = t* + 1, the infected node farthest from node u must be 
on T~ u , which implies that there exists a node a £ T~ u such 
that d(u, a) =t* u = t* + 1 and d(v, a) = t* v . So node v cannot 
reach a within t* — t\ time slots, which contradicts the fact 
that the infection can spread from node v to a within t* u — 1\ 
time slots along the sample path X*[Q,tl]. Therefore, t\ = 1. 

Step 3: Now given sample path X*[0,£*], we construct 
X„[0, t'*] which occurs with a higher probability. We divide 
the sample path X*[0,i*] into two parts along subtrees T~ v 
and T~ u . Since t ! v = 1, we have 

Pr(x;[o,t;]) 

= 9 Pr(x; ([o,<],t-«) |^ = i) p r (x; {[ox\,t- v )) , 

where q is the probability that v is infected at the first time 
slot. Suppose in X^fO,^*], node u was infected at the first 
time slot, then 



Pr(X w [0X]) = 

q Pr (X„ ([0, tl], T-")) Pr (x„ ([0, T"") 



t£ = 1 



For the subtree T~ u , given X* ([0, t* u ], T~ u ) , in which 
fej = 1, we construct the partial sample path X^ ([0, t*], T" 11 ) 
to be identical to X* ([0, i*], T~") except that all events occur 
one time slot earlier, i.e., 

x,([o ) t:],v) = x;([i,t;],v). 

This is feasible because f* = <* — 1. Then 

Pr (x; ([0,t;],T-«) |t£ = l) = Pr (X, ([0,t:],T-«)) . 

For the subtree T~ v ', we construct X^([0, t*], T~ v ) such 
that 



x„([o,t:],T„-)e 

&rgm&x ±{[0K] T -v )€X{K T - n Pr (x ([0,t*],T~' u ) 
Based on Lemma [T] we have 



t£ = 1 



max Pr 

x([o,t;],T l r l ')eA'(tj,T l r") 



X([0,t 

> 



max Pr 



max 

X.([0,t^],T~ v )eX(f u ,T- v ) 



(x([o,t;],r-") 

(X ([O^-ll.T-") 
Pr (X([0,t;],T-»)). 



= i 



Therefore, given the optimal sample path rooted at u, we have 
constructed a sample path rooted at v which occurs with a 
higher probability. The lemma holds. ■ 
Next, we give a useful property of the Jordan infection 
centers in the following lemma. 

Lemma 3. On a tree network with at least one infected node, 
there exist at most two Jordan infection centers. When the 
network has two Jordan infection centers, the two must be 
neighbors. □ 



Proof: First, we claim if there are more than one Jordan 
infection centers, they must be adjacent. Suppose v, u G V are 
two Jordan infection centers and e(v) — e(u) = A. Suppose 
v and u are not adjacent, i.e., d(v, u) > 1. Then, there exists 
w G V such that 

d(w, u) = 1, 



and 



d(w, v) — d(v, u) — 1, 



i.e., w is a neighbor of u and is on the shortest path between 
u and v. Note in a tree structure w is unique. 
lflDT- w = 0, then Va G I, 

a) = d(it, a) — 1 < d(u, a), 

which contradicts the fact that u is a Jordan infection center. 

If X n T- 10 ^ 0. Since Vkln I 1 -™, 



6) = d(v, w) + d(w, 6), 



i.e., 



d(w, b) = d(v, b) — d(v, w) < A — 1. 
On the other hand, since e(u) = A, Vh 6 T^ 11 n I, 

d(tu,/i) = d(«,h) - 1 < A - 1. 
In a summary, V/i G X, 

< A — 1, 

which contradicts the fact that the minimum infection eccen- 
tricity is A. 

Therefore all Jordan infection centers must be adjacent to 
each other. However, suppose there exist n infection eccen- 
tricity centers where n > 2, they would form a clique with 
n nodes which contradicts the fact that the graph is a tree. 
Therefore, there exist at most two adjacent Jordan infection 
centers. ■ 

Based on Lemma [2] and Lemma [3] we finish this section 
with the following theorem. 

Theorem 4. Consider a tree network with infinitely many 
levels. Assume that the observed infection topology Y contains 
at least one infected node. Then the source node associated 
with X*[0,£*] (the solution to the optimization problem Q) 
is a Jordan infection center, i.e., 

= argmine(t'). 

Proof: We assume the network has two Jordan infection 
centers: w and u, and assume e(w) — e(u) = A. The same 
argument works for the case where the network has only one 
Jordan infection center. 

Based on Lemma [3] w and u must be adjacent. We will 
show for any a e V\{w, u}, there exists a path from a to it 
(or w) along which the infection eccentricity strictly decreases. 

Step 1: First, it is easy to see from Figure |4]that d(-y, w) < 
A — 1 V7 G T~ u n I. We next show that there exists a node 
£ such that the equality holds. 



(a) 



Xi/^f@>-"""A - 1 



Figure 4. A Pictorial Description of the Positions of Nodes a, u, w and £. 



Suppose that ^(7, w) < A — 2 for any 7 e T w u n I, which 
implies 

d(% u) < A - 1 V 7 G T~ u n 1. 
Since w and u are both Jordan infection centers, we have 



In a summary, V7 € I, 



d(7, w) < A 
d(7,ti) < A - 1. 

T 

d(j,u) < A - 1. 



This contradicts the fact that e(w) = e(w) = A. Therefore, 
there exists £ g n T such that 

d(£,w) = A - 1. 

Step 2: Similarly, V7 € T^" 1 n X, 

d(7,«)<A-l, 

and there exists a node such that the equality holds. 

Step 3: Next we consider a G V\{w, u}, and assume a € 
T^™ and d(o, u) = /?. Then for any 7 G T"" n X, we have 

d(a, 7) = d(a, u) + d[u, w) + d(w, 7) 
</3+l + A-l 
= A + A 

and there exists £ G T~ u HI such that the equality holds. On 
the other hand, V7 G T~ w n X. 

d(a, 7) < d(a, u) + d(u, 7) 
</3 + A-l. 

Therefore, we conclude that 

e(a) = A + /3, 

so the infection eccentricity decreases along the path from a 
to u. 

Step 4: Repeatedly applying Lemma [2] along the path from 
node a to u, we can conclude that the optimal sample path 
rooted at node u is more likely to occur than the optimal 
sample path rooted at node a. Therefore, the root node 
associated with the optimal sample path X*[0, t*] must be 
a Jordan infection center, and the theorem holds. ■ 



IV. Reverse Infection Algorithm 



V. Performance Analysis 



Since in tree networks with infinitely many levels, the 
estimator based on the sample path approach is a Jordan 
infection center, we view the Jordan infection centers as 
possible candidates of the information source. We next present 
a simple algorithm to find the information source in general 
networks. The algorithm is to first identify the Jordan infection 
centers, and then break ties based on the sum of distances to 
infected nodes. 

The key idea of the algorithm is to let every infected 
node broadcast a message containing its identity (ID) to 
its neighbors. Each node, after receiving messages from its 
neighbors, checks whether the ID in the message has been 
received. If not, the node records the ID (say v), the time at 
which the message is received (say t v ), and then broadcasts 
the ID to its neighbors. When a node receives the IDs of all 
infected nodes, it claims itself as the information source and 
the algorithm terminates. If there are multiple nodes receiving 
all IDs at the same time, the tie is broken by selecting the 
node with the smallest tv ■ 

The tie-breaking rule we proposed is to choose the node 
with the maximum infection closeness JIT). The closeness 
measures the efficiency of a node to spread information to 
all other nodes. The closeness of a node is the inverse of the 
sum of distances from the node to any other nodes. In our 
model, we define the infection closeness as the inverse of the 
sum of distances from a node to all infected nodes, which 
reflects the efficiency to spread information to infected nodes. 
We select a Jordan infection center with the largest infection 
closeness, breaking ties at random. 



Algorithm 1 Reverse Infection Algorithm 
for ieldo 

i sends its ID to its neighbors, 
end for 

while t > 1 and STOP== do 
for u e V do 

if u receives w,; for the first time then 

Set t U i — t and then broadcast the message to its 
neighbors. 

If there exists a node who received \1\ distinct 
messages, then set STOP == 1. 
end if 
end for 
end while 

return w = argmin uS 5 ^ ieI t U i, where S is the set of 
nodes who receive \I\ distinct messages when the algorithm 
terminates. Ties are broken at random. 



It is easy to verify that the set S is the set of the Jordan 
infection centers. The running time of the algorithm is equal 
to the minimum infection eccentricity and the number of 
messages each node receives/sends during each time slot is 
bounded by its degree. 



The reverse infection algorithm is based on the structure 
properties of the optimal sample paths on trees. While the 
MLE is the node that maximizes the likelihood of the snapshot 
among all possible nodes, the sample path based estimator 
does not have such a guarantee. To demonstrate the effective- 
ness of the sample path based approach, we next show that 
on (g + l)-regular trees where each node has g + 1 neighbors, 
the information source generated by the reverse infection 
algorithm is within a constant distance from the actual source 
with a high probability, independent of the number of infected 
nodes and the time at which the snapshot Y was taken. 

Theorem 5. Consider a (g + l)-regular tree with infinitely 
many levels where g > 2 and gq > 1. Assume that the 
observed infection topology Y contains at least one infected 
node. Given e > 0, there exists d t such that the distance 
between the optimal sample path estimator and the actual 
source is d f with probability 1 — e, where d f is independent 
of the number of infected nodes and the time the snapshot Y 
was taken. 

Proof: Consider the tree rooted at the information source 
v* . We say v* is at level 0. We denote by Zi the set of infected 
and recovered nodes at level I. Furthermore, we define Z[ to 
be the set of infected and recovered nodes at level I whose 
parents are in set ZJ_ 1 and who were infected within r time 
slots after their parents were infected. We assume Zq = {v*}. 
In addition, let Zi = \Zr\ and ZJ = \ZJ\. 
Note 

lim Z{ = Z h 

T— > OO 

and given v and u € Z[, 

\tl-tl\<l(T-l), 

i.e., the infection times of nodes in ZJ differ by at most l(r — 
1) (note that the difference is not r — 1 since the parents of u 
and v may be infected at different times). Our proof is based 
on the Galton Watson (GW) branching process p8| . A GW 
branching process is a stochastic process B(l) which evolves 
according to the recurrence formula B(0) = 1 and 

fl(i-i) 

i=l 

where {Q} is a set of random variables, taking values from 
nonnegative integers. The distribution of Q is called the 
offspring distribution of the branching process. In a (g + Ir- 
regular tree, the evolution of ZJ is a branching process, where 
the offspring distribution is a function of r. We use B T to 
denote the corresponding branching process, and B T (l) to 
denote the number of offsprings at level /, i.e., B T (l) = Z[ 
(we use these two notations interchangeably). Given a node is 
in the infected state for t time slots, the number of infected 
offsprings follows a binomial distribution. Note the following 
two facts: 




l(r + l) 



B l Ul (l) Bl 2 (l) 

Figure 5. A pictorial description of the positions of v', v, ui, and w. 

• The number of time slots at which a node is in the 
infected state follows a geometric distribution with pa- 
rameter p. 

• A child remains to be susceptible with probability (1— q) T 
when the parent has been in the infected state for t time 
slot. 

Therefore, the offspring distribution of the branching process 
B T at level > 10 is 

Pr( 7 = i) 

= >T(i - P y- (i - (i - q yy (i - 



where 7 is the number of offsprings of a node. The offspring 
distribution of branching process B°° is 

Pr( 7 ' = i) 

=E( 1 -^(?) (i-a-^ra-^- 

Each infected node can be viewed the source of branching 
processes on the subtree rooted at the node. We define K\ 
to be the number of survived B 1 branching processes whose 
roots are in set ZJ , where a branching process survives if it 
never dies out. 

Now given L > 2, we consider the following events: 

. Event 1: Z L = 

• Event 2: Ki > 2 for some I < L. In other words, at least 
two B 1 branching processes starting from ZJ survive for 
some I < L. 
We note that these two are disjoint events. 

When Zi = 0, no node at level L is infected and the 
infection process terminates at level L — 1. When there is 
at least one infected node in Y, since e(v*) < L — 1, the 
minimum infection eccentricity is at most L — 1. Therefore, 
the distance between v* and is no more than 2(L — 1). 

2 The source node has g + 1 children while other nodes have g children 



Given Ki > 2 for some I < L, we will argue that the 
distance between the sample path based estimator and the 
actual one is upper bounded by (t + 1)L — 1. Consider Figure 
[5] where the shaded nodes are infected and recovered nodes. 
We will show that if two B 1 branching processes starting from 
I < L survive, a node at level > (t+1)L— 1 cannot be a Jordan 
infection center. Recall that at time t, the distance between any 
infected node and the actual source is no more than t, which 
implies the eccentricity of a Jordan infection center is < t. 
Now consider a node v at level > (r + 1)1 — 1. Recall that at 
least two B 1 branching processes starting from level / survive. 
Let ui £ ZJ be the root of a survived B 1 branching process, 
and assume node v is not on the subtree rooted at u\ . Further, 
assume v' is an infected node at the lowest level on sub-tree 
T~™ . Since the branching process survives, the infection 
process propagates one level lower at each time slot and node 
v' is at level l + t-t^. 

From Figure [5] it is easy to see that the distance between 
v' and v is at least 



■2 + (T+l)l-l-l = t-t I Ul +Tl + l, 



which occurs when the first common predecessor of nodes v' 
and v is at I— I level. Note that the common predecessor cannot 
appear at level > / since v is not on T~™ . Since u\ S ZJ , the 
infection time of node u\ is no later than rl, i.e., < rl. 
Therefore, the distance between v' and v is at least t + 1, 
which is larger than t. Hence, v' cannot be a Jordan infection 
center. Since I < L, any node at or below level (r + 1)L — 1 
cannot be a Jordan infection center. In a summary, if event 2 
occurs, then we have 

d(v*,wt) < (t + 1)L-1. 

We next show that given any e, we can find sufficiently 
large r and L, independent of t and the number of infected 
nodes, such that the probability that either event 1 or event 2 
occurs is at least 1 — e. 

Given n > and r > 0, we define 

= min{Z : ZJ > tiq} , 

i.e., ft is the first level at which B T has more than no nodes. 
We first have 



Pi{Z L = 0) + Pr (Ki > 2 for some I < L) 
> Pr(Z L = 0) + Pr (iQt > 2 and < L) 

Z f < L 



/ f < L 



= Pr(Z L = 0) + Pr (7 f < L) Pr LfQt > 2 
= Pi{Z L = 0) + Pr MJ {ZJ > no}J Pr (k > 2 

Pr (jfy > 2^ < Lj + Pr(Z L = 0). 



> 1 - Pr 



Note that we have 

Pr(i£" it > 2ji+ < L) 

L 

= ^Pr(^ t > 2|ft = l)Pi(tf = < L). (13) 



1=1 



According to Lemma [6] given any t\ > 0, we can find a 
sufficiently large no such that 

Pr(JT It > 2|jt = J) > (l- ei ), 

which implies that for sufficiently large no, 

Pr(^ t > 2 1 < L) > l-ei. 

We can then conclude 

Pi{Z L = 0) + Pr (K x > 2 for some I < L) 

> - Pr Q {° < Z l < n o?j j (1 - ei) 

-Pr (jj^ =0 >) +Pr(Z L =0) 

= ^l-PrfQ{0<^<no}^ (l-ei) 

+ Pr(Z L =0)-Pr(Z£=0), 

where Pr(uf =1 {ZT = 0}) = Pr(Z£ = 0) because Z[ = 
implies that Z T L = for I < L. 

According to Lemma [7] and Lemma [8] given any e 2 > 
and 63 > 0, there exist sufficiently large r and L such that 



1-Pr ^[]{0<Zl <n }j \ >1 



£2, 



and 



Pr(^ L = 0) - Pr(ZX = 0) > -e 3 . 
Hence, we have 

Pr(Z L = 0) + Pr (Jfj > 2 for some I < L) 

> (l-ei)(l-e 2 )-e 3 . 

Now choosing ei = £2 = £3 = £4/3 for some £4 > 0, we have 

Pr(Z L = 0) + Pr (Ki > 2 for some I < L) 

> 1 - £4. 

Now let |Y| denote the number of infected nodes in the 
observation Y. Define events E\ — {Z^ — 0} and E 2 = 
{Ki > 2 for some I < L}. We have 

Pr^iHY] = l)+Pr( J B 2 ||Y| = 1) 

= Pr(|Y 1 | = 1) (Pr(^i n {|Y| = 1}) + Pr (E 2 n {|Y| = 1})) 



Since E 2 implies that |Y| = 1, we have 
Pr(E x ||Y| =1) + Pr (£ 2 ||Y| = 1) 

' (Pr(£ 1 n{|Y| = l}) + Pr(£ 2 )) 



CPr(£?i) - Pr(£i n {|Y| = 0}) + Pr (£ 2 )) 
(Pr(£i)-Pr({|Y|=0})+Pr (£7 2 )) 
(Pr({|Y| = l})-£ 4 ) 



Prf 
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= 1) 
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= 1) 


Pr( 


|Y| 
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= 1) 


Pr( 


|Y| 


= 1) 


1 - 




£4 


Pr( 


IYI = 



Note that Pr(|Y| = 1) is a positive constant since the 
B 1 branching process starting from the information source 
survives with non-zero probability. The theorem holds by 
choosing £4 = ePr(|Y| = 1). ■ 

Lemma 6. Consider no i.i.d GW branching processes with a 
binomial offspring distribution with parameters g and q such 
that gq > 1. Denote by K the number of branching processes 
that survive. Given any £ > 0, if 

81ogi 



then 



no > 

1 ~ 9 

Px(K > 2) > 1 - e, 



where p is the extinction probability of the GW branching 
process. In the binomial case, p is the smallest non-negative 
root of equation p = (1 — q + qp) 9 . □ 

Proof: The extinction probability of a GW branching 
process is denoted by p, which is the smallest none negative 
root of equation p = G(p) according to fl8| , where G(p) is 
the moment generating function of offspring distribution. In 
the binomial case we have G{p) = (1 — q + qp) 9 . p < 1 when 
gq > 1. 

We define a Bernoulli random variable Hi, for the i th 
branching process such that 



Hi 



1, if the ith branching process survives; 
0, otherwise. 



(l-p)r» ^ 



So K = Y£±Hi, and 

E[K]=n (l-p). 
According to the Chemoff bound |19], we have 

Pr(K < (1 - <5)(1 - p)n ) <e 
Choose 8 = 0.5. The Lemma holds if 

(l-p)no/2>2, 

and 

(l-p)n /8>logl/£. 



Lemma 7. Given any e > 0, there exists a constant L' such 
that for any L > L' , 



Pr 



f|{o<^r< no }j < e . 

vi=l / 



□ 



Proof: Define p T to be the probability that a node infects 
at least one of its children if it is in the infection state for t 
time slots. We have 

r-l 

PT = 5>-P)*~V(l-(l-9) 9t ) 



+ (i- P y- 1 (l-(l-q)n, 



and 



Pr(0 < Z{ < n |0 < Z\_ x < n ) 
<Pr(Z{ > 0|0 < ZJ_ X < n ) 
<l-(l-p T )«°, 

which implies that 

Pr (f]0<ZT<n^j 

= Pr{0<Z T L <n a \0<Z T L _ 1 <n o ) 

x Pr(0 < Z T L _ X < n |0 < Z T L _ 2 < n ) ■ ■ ■ 

x Pr(0 < Zl < n o |0 < Z{ < n ) Pr(0 < Z\ < n ) 

<(l-(l-p T ) n °) L . 
The lemma holds by choosing 

loge 



L' 



log(l-(l-p T )™o) 



Lemma 8. Given any e, ?/zere exist r' ana" L' swc/z fnaf /or 
any r > r' ana" L > L' 



Pr(Z L = 0) - Pr(Z£ = 0) > -e. 



□ 



Proof: Note the difference can be re-written as 

Pr(Z L = 0) - Pt(Z t l = 0) 
- (Pr(Z L =0) - Pr^ - 0)) + (Pr(^ =0) -Pr(Z^O)) 
+ (Pr(Z oo =0)-Pr(22 o =0)). 

Step 1 Since {Z£ = 0} C {Z^ = 0}, 

Pr(Z^ = 0) - Pv(Zl = 0) > 0. 
Step 2 We know 

lim Pr(Z L = 0) =Pr(Z oc = 0). 

L— > oo 

Then for e/2 > 0, there exists L' such that for any L > L' , 
|Pr(Z L = 0)-Pr(Z oo =0)|<e/2, 



which implies that 

Pr(Z L = 0) - Pr^ = 0) > -e/2. 
Step 3 In this step, we will show 

lim Pr(Z^ =0) =Pr(Z oc =0). 

T— > OO 

Define the generating functions of the offspring distributions of 
B T and B°° to be G T (s) and G(s), respectively. We know that 
G T (s) — s and G(s) — s are convex functions when s € [0, 1]. 
Let p = Pr(Z oc = 0), i.e., the extinction probability, we know 
that p is the smallest nonnegative root of G(p) — p and p < 1. 

Similarly, define p T = Pr(ZJ- = 0), and p' = lim T _ i . 00 p T . 
Taking limit on both sides of G T (p T ) = p T , we have 



Note that 



for any r, so 



G(p') = p'. 
p < Pt < Pi < 1 
P < / < Pi < 1. 



Since G(s) — s = has at most two solutions in [0, 1] and 
s = 1 is one of them, we conclude p' = p. Therefore, for 
given e/2 > 0, there exists r > r' such that 

Pr^ = 0) - Pr(Z^ = 0) > -e/2. 

Hence, the lemma holds. ■ 

VI. Simulations 

In this section, we evaluate the performance of the reverse 
infection algorithm on different networks, including different 
tree networks and some real world networks. 

A. Tree Networks 

In this section, we evaluate the performance of the reverse 
infection algorithm on tree networks. We compare the reverse 
infection algorithm with the closeness centrality heuristic, 
which selects the node with the maximum infection close- 
ness as the information source. Note that the node with the 
maximum closeness is the maximum likelihood estimator of 
the information source on regular trees under the SI model 
0-0. 

1 ) Small-size tree networks: We first studied the perfor- 
mance on small-size trees. The infection probability q was 
chosen uniformly from (0,1) and the recovery probability 
p was chosen uniformly from (0, q). The infection process 
propagates t time slots where t was uniformly chosen from 
[3,5]. To keep the size of infection topology small, we 
restricted the total number of infected and recovered nodes to 
be no more than 100. For small-size trees, we first calculated 
the MLE using dynamic programming for fixed t and then 
searching over t £ [0, i max ] for a large value of t max to find 
the optimal estimator. 

The detection rate is defined to be the fraction of experi- 
ments in which the estimator coincides with the actual source. 
We varied g from 2 to 10 and the results are shown in Figure 




Figure 6. The Detection Rates of the Maximum Likelihood Estimator (MLE), 
Reverse Infection (RI) and Closeness Centrality (CC) on Regular Trees 



[6] We can see that the detection rate of the reverse infection 
algorithm is almost the same as that of the MLE, and is higher 
than that of the closeness centrality heuristic by approximately 
20% when the degree is small and by 10% when the degree 
is large. 
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Figure 7. The Detection Rates of the Reverse Infection (RI) and Closeness 
Centrality (CC) Algorithms on Regular Trees 

2) General g-regular tree networks: We further conducted 
our simulations on large-size g-regular trees. The infection 
probability q was chosen uniformly from (0, 1) and the 
recovery probability p was chosen uniformly from (0, q). 
The infection process propagates t time slots where t was 
uniformly chosen from [3,20]. We selected the networks in 
which the total number of infected and recovered nodes is no 
more than 500. 

We varied g from 2 to 10. Figure [7] shows the detection rate 
as a function of g. We can see the detection rates of both the 
reverse infection and closeness centrality algorithms increase 
as the degree increases and is higher than 60% when g > 6. 
However, he detection rate of the reverse infection algorithm 
is higher than that of the closeness centrality algorithm, and 
the average difference is 8.86%. 

3) Binomial random trees: In addition, we evaluated the 
performance on binomial random trees where the number of 
children of each node follows a binomial distribution with 
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Figure 8. The Detection Rates of the Reverse Infection (RI) and Closeness 
Centrality (CC) Algorithms on Binomial Random Trees 
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Figure 9. The Performance of the Reverse Infection (RI) on the Internet 
Autonomous Systems Network 



number of trials g' and success probability (3. We fixed g' = 10 
and varied j3 from 0.1 to 0.9. The durations of the infection 
process and the observed infected networks were selected 
according to the same rules for the g-regular tree case. The 
results are shown in Figure [8] Similar to the regular tree case, 
as j3 increases, the tree is more denser which increases the 
number of survived branching processes and the detection 
rate. The reverse infection algorithm outperforms the closeness 
centrality algorithm by 10.16% on average. 

B. Real World Networks 

We next conducted experiments on three real world net- 
works — the Internet Autonomous Systems network (IASQ 
the Wikipedia who- votes -on- whom network (Wikipeida) 4 , and 
the power grid network (PG^J We compare the reverse infec- 
tion algorithm with random guessing, which randomly selects 
a node and declares it as the information source. In these net- 
works, the infection probability q was chosen uniformly from 
(0, 0.05) and the recovery probability p was chosen uniformly 
from (0, q). Here we chose small infection probabilities since 
the network was of finite size so the infection process should 
be controlled to make sure that not all nodes were infected 
when the network was observed. The duration t was an integer 
uniformly chosen from [3,200]. We selected the networks in 



3 Available at 
4 Available at 



http://snap.stanford.edu/data/index.html 



http://www-personal.umich.edu/~mejn/netdata/ 
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Figure 10. The Performance of the Reverse Infection (RI) on the Wikipedia 
Who-Votes-on-Whom Network 
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Figure 11. The Performance of the Reverse Infection (RI) on the Power 
Grid Network 



which the total number of infected and recovered nodes was 
in the range of [50, 500]. 

1 ) The Internet autonomous systems network: Figure [9] 
shows the results on the the Internet autonomous systems 
network. An Internet autonomous system is a collection of 
connected routers who use a common routing policy. The 
Internet autonomous system network is obtained based on 
the recorded communication between the Internet autonomous 
systems inferred from Oregon route-views on March, 31st, 
2001. The network consists of 10,670 nodes and 22,002 edges. 
According to Figure [9] more than 80% of the estimators 
identified by the reverse infection algorithm are no more than 
two hops away from the actual sources, comparing to 10% 
under the random guessing. 

2) The Wikipedia who-votes-on-whom network: Figure 10 
shows results on the Wikipedia who-votes-on-whom network, 
in which two nodes are connected if one user voted on the 
other in the administrator promotion elections. The network 
has 100,736 links and 7,066 nodes. We have similar obser- 
vations as for the Internet autonomous systems network: the 
majority of the estimators produced by the reverse infection 
algorithm are no more than two hops away from the actual 
sources; and only less than 20% of the estimators of random 
guessing are within two hops from the actual sources. 

3) The power grid network. 



11 shows the results 



Figure 

on the power grid, which has 4,941 nodes and 6,594 edges. 
As we can see, the reverse infection algorithm performs better 
than the random guessing. The peak of the reverse infection 
algorithm appears at the third hop versus the seventeenth hop 
under random guessing. 



VII. Conclusion 

In this paper, we developed a sample path based approach to 
find the information source under the SIR model. We proved 
that the sample path based estimator is a node with the mini- 
mum infection eccentricity. Based on that, a reverse infection 
algorithm has been proposed. We analyzed the performance of 
the reverse infection algorithm on regular trees, and showed 
that with a high probability the distance between the estimator 
and actual source is a constant, independent of the number of 
infected nodes and the time the network was observed. We 
evaluated the performance of the proposed reverse infection 
algorithm on several different network topologies. 

Appendix A 
Notation Table 



n 

H 


the nrohahilitv an inferterl node infects its 
neighbors 


P 


thf* nrnhahilitv an infpptpH nnrlp vprnvpvd 
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V* 


the actual information source 


V 


L11C CaLllllaLUl Ul L11C llllUllllaLlUll aUUlCC 


d(v, u) 


the length of shortest path between node v 
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trif 1 cf^t c\t pnilHrf^n nf nnHp it 
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e(v) 


the infection eccentricity of node v 


K 


the time duration associated of the optimal 
sample path in which node v is the infor- 
mation source 


tl 


the infection time of node v 




the recovery time of node v 


Y 


the snapshot of all nodes 


rp — V 
U 


the tree rooted at node u but without the 
branch from v. 


X v (t) 


the state of node v at time t 


X(i) 


the states of all nodes at time t 


X[0,i] 


the sample path from to t 


X([0,*],T-") 


the sample path from time slot to t 
restricted to T~ v 


X(t) 


the set of all valid sample path from time 
slot to t 


X(t,T-") 


the set of all valid sample path from time 
slot to t restricted to T~ v 


1 


the set of the infected nodes 
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